Skip to content

Conversation

@eddierichter-amd
Copy link
Collaborator

The initial implementation of summarization used the .generate() method in the ChatSDK which bypassed sending previously stored messages to the LLM. It therefore needed to send the entire prompt to the agent which was formatted slightly differently and bypassed the KV cache and had to be processed. This resulted in very high TTFT for summarization. This change reduces the TTFT of summarization to be similar to other requests.

@eddierichter-amd eddierichter-amd force-pushed the eddie/summarization-kv-cache-eddie-fork branch from 4e2b5ae to 7a33fa9 Compare January 9, 2026 19:14
@kovtcharov-amd
Copy link
Collaborator

@claude review this PR.

@github-actions github-actions bot added the chat Chat SDK changes label Jan 12, 2026
@github-actions
Copy link

github-actions bot commented Jan 12, 2026

Claude finished @kovtcharov-amd's task —— View job


PR Review: KV Cache for Summarization

  • Read the PR diff and understand the changes
  • Review code quality and correctness
  • Check for potential issues
  • Provide review feedback

Overall Assessment: 🟢 Looks good with minor suggestions

The PR correctly addresses the TTFT issue by switching from llm_client.generate() to send(), which reuses the existing chat history formatting path and enables KV cache utilization.

What's Good

  1. KV cache optimization - Using send() with existing history allows the LLM to leverage cached KV pairs from previous turns
  2. History preservation on failure - The backup/restore pattern (original_history) properly handles error cases
  3. Consistent formatting - History formatting now follows the same path as normal chat turns

Issues Found

🟡 History pollution - The send() call at line 831 will append BOTH the summary prompt as a user message AND the summary response to chat_history (see sdk.py:362 and sdk.py:391). Then the code clears history and adds just the summary at line 850-853. This works but creates an intermediate state where the summarization messages briefly exist in history.

Suggestion: Consider using no_history=True parameter in the send() call to avoid polluting history, then manually manage what gets stored:

chat_response = self.send(
    summary_prompt,
    no_history=True,  # Don't add to history
    max_tokens=min(self.config.max_tokens, 2048),
    timeout=1200,
)

This would be cleaner since you're clearing and replacing history anyway. However, using no_history=True would bypass the existing history context being sent to the LLM - so the current approach may actually be intentional to ensure the model sees the conversation when summarizing. Please confirm this is the intended behavior.

🟢 Minor - Line 843: The if chat_response check is good defensive coding, but send() raises on error rather than returning None, so the empty string fallback may never trigger. Not a problem, just redundant.

Verification

The logic flow is:

  1. Backup history → 2. Call send() (adds prompt+response to history) → 3. Clear history → 4. Add only summary

This correctly leaves only the summary in history after completion.

Approve with minor suggestion - The code is functional and achieves the stated goal. The no_history=True suggestion is optional depending on whether you want the LLM to see prior context during summarization.


@eddierichter-amd
Copy link
Collaborator Author

@claude your guess at the end is correct, if we set no_history=True we would not have the history and we wouldn't be able to summarize the context. Are their any other Medium priority issues with this PR?

@github-actions
Copy link

github-actions bot commented Jan 13, 2026

Claude finished @eddierichter-amd's task —— View job


Re-reviewing PR for Medium Priority Issues

  • Read the PR diff
  • Check for any medium priority concerns
  • Provide findings

Thanks for confirming the no_history design decision! That makes sense.

Additional Findings:

After re-examining the PR, I don't see any other medium priority issues. The changes are clean and correct. Here's my assessment of the remaining minor items:

🟢 Minor (already mentioned): Line 843 has a defensive check if chat_response else "" that's technically redundant since send() raises exceptions rather than returning None. However, it's harmless defensive coding and doesn't hurt anything.

🟢 Minor: The prompt at line 823-824 now says "You have full access to the prior conversation history above" - this wording is clear and appropriate for the new flow.

No other issues found. The implementation correctly:

  • Backs up history before the operation
  • Restores history on failure paths
  • Clears and replaces with summary on success
  • Uses appropriate timeout and max_tokens constraints

The PR is good to merge from a code quality perspective.


@eddierichter-amd
Copy link
Collaborator Author

@kovtcharov-amd this has passed Claude's review. Let me know what you think. This is a really nice quality of life improvement as it reduces the time of summarization from over a minute to a couple seconds given.

@eddierichter-amd eddierichter-amd added this pull request to the merge queue Jan 14, 2026
Merged via the queue into amd:main with commit 5783754 Jan 14, 2026
64 of 70 checks passed
@eddierichter-amd eddierichter-amd deleted the eddie/summarization-kv-cache-eddie-fork branch January 14, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chat Chat SDK changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants